Identification of Multiword Expressions in the brWaC
نویسندگان
چکیده
Although corpus size is a well known factor that affects the performance of many NLP tasks, for many languages large freely available corpora are still scarce. In this paper we describe one effort to build a very large corpus for Brazilian Portuguese, the brWaC, generated following the Web as Corpus kool yinitiative. To indirectly assess the quality of the resulting corpus we examined the impact of corpus origin in a specific task, the identification of Multiword Expressions with association measures, against a standard corpus. Focusing on nominal compounds, the expressions obtained from each corpus are of comparable quality and indicate that corpus origin has no impact
منابع مشابه
Parsing Models for Identifying Multiword Expressions
Multiword expressions lie at the syntax/semantics interface and have motivated alternative theories of syntax like Construction Grammar. Until now, however, syntactic analysis and multiword expression identification have been modeled separately in natural language processing. We develop two structured prediction models for joint parsing and multiword expression identification. The first is base...
متن کاملIdentifying Portuguese Multiword Expressions using Different Classification Algorithms - A Comparative Analysis
This paper presents a comparative analysis based on different classification algorithms and tools for the identification of Portuguese multiword expressions. Our focus is on two-word expressions formed by nouns, adjectives and verbs. The candidates are selected on the basis of the frequency of the bigrams; then on the basis of the grammatical class of each bigram’s constituent words. This analy...
متن کاملDomain-Dependent Identification of Multiword Expressions
The identification of different kinds of multiword expressions require different solutions, on the other hand, there might be domain-related differences in their frequency and typology. In this paper, we show how our methods developed for identifying noun compounds and light verb constructions can be adapted to different domains and different types of texts. Our results indicate that with littl...
متن کاملA data-driven approach to verbal multiword expression detection. PARSEME Shared Task system description paper
Multiword expressions are groups of words acting as a morphologic, syntactic and semantic unit in linguistic analysis. Verbal multiword expressions represent a subgroup of multiword expressions, namely that in which a verb is the syntactic head of the group considered in its canonical (or dictionary) form. All multiword expressions are a great challenge for natural language processing, but the ...
متن کاملUnsupervised Construction of a Lexicon and a Repository of Variation Patterns for Arabic Modal Multiword Expressions
We present an unsupervised approach to build a lexicon of Arabic Modal Multiword Expressions (AM-MWEs) and a repository of their variation patterns. These novel resources are likely to boost the automatic identification and extraction of AM-MWEs.
متن کامل